The goal of this project is to use unsupervised learning techniques to identify wine categories. Our dataset is composed of 11 numerical physical-chemical measurements which will be used by a gaussian mixture model to identify these distinct categories. We will not assume any particular number of clusters beforehand but will rather use the silhouette score as an indicator of the best number of clusters to segment our data by. Our dataset is a combination of a red wine dataset and a white wine dataset but we will not use the color by our clustering algorithm. Instead, we are interested in whether our algorithm will naturally segment wine into white and red categories.
The dataset can be found here: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
%matplotlib notebook
import numpy as np
import pandas as pd
import random
from IPython.display import display #for displaying dataframes nicely
import matplotlib.pyplot as plt
random.seed(137)
#combine red and white datasets
reds= pd.read_csv('winequality-red.csv')
reds['type']= 'red'
whites= pd.read_csv('winequality-white.csv')
whites['type']='white'
#drop the "quality" column as it is not a physical property that we are interested in
wines= pd.concat([reds, whites])
wines.drop(columns=['quality'], inplace=True)
print('rows, columns: {}'.format(wines.shape))
display(wines.head())
wines.describe()
As seen above, our values have very different ranges. We will need to rescale our data for optimal performance with typical machine learning algorithms. Before doing so, let's check for highly correlated features. If some features are found to be highly correlated, they will likely be redundant in the learning process and we can drop all but one of each group of highly correlated features.
#Use a heatmap to visualize the greatest correlations
plt.style.use('default')
plt.figure(figsize=(7, 7))
plt.imshow(wines.corr(), cmap='Reds', interpolation= 'nearest')
plt.xticks(np.arange(len(wines.corr().index.values)), wines.corr().index.values, fontsize=12, rotation=-60)
plt.yticks(np.arange(len(wines.corr().index.values)), wines.corr().index.values, fontsize=12)
plt.title('Heatmap of Correlations')
plt.tight_layout()
plt.show()
#View the numeric data corresponding to the above
wines.corr()
#Fetch top correlations
#Return a copy of dataframe with only metric columns
def drop_dims(df):
df= df.copy()
for i, e in zip(df.columns.values, df.dtypes):
if e not in [np.float64, np.int64]:
df.drop(i, inplace=True, axis=1)
return df
#Every pair of metrics shows up twice
#This function removes one version of each pair (along with pricipal axis in which all correlations equal 1.0)
def get_redundant_pairs(df):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
#return the highest correlations
def get_top_abs_correlations(df, n=3):
df= drop_dims(df)
au_corr = df.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations\n")
display(get_top_abs_correlations(wines))
The highest correlation found was between free sulfur dioxide and total sulfur dioxide, which was 0.72. This is too low to justify dropping a column, especially for a small dataset which shouldn’t pose any significant performance concerns.
Let’s now look at the distributions of each metric.
#create a subplot showing a histogram for each metric
def show_metric_dist(df, title):
plt.style.use('ggplot')
plt.rcParams['figure.figsize']= [10, 8]
fig = plt.figure()
plt.subplots_adjust(hspace=0.4)
fig.suptitle(title, fontsize=20, y=0.98)
j=1
for i, e in zip(df.columns.values, df.dtypes):
if e not in [np.float64, np.int64]:
continue
ax=fig.add_subplot(3, 4, j)
df[i].hist(bins=50, color='maroon')
ax.set_title(i)
j+=1
plt.show()
show_metric_dist(wines, title='Metric Distributions (Before Normalizing & Scaling)')
We see that the distributions of many of these metrics is significantly skewed. In general, machine learning algorithms works best when metrics have a roughly gaussian distribution. We will need to normalize this data in addition to scaling it.
We first normalize the data using boxcox. A shift is manually added as suggested in the documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
We use the MinMaxScaler to normalize the data such that all metrics range from 0 to 1.
#test alternative normalization
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
#scale to gaussian
wines_mets= drop_dims(wines)
wines_scaled = wines_mets.apply(lambda x: stats.boxcox(x+1e-8)[0], axis=0)
scaler = MinMaxScaler()
wines_norm = pd.DataFrame(wines_scaled)
wines_norm = pd.DataFrame(scaler.fit_transform(wines_scaled), columns=wines_scaled.columns)
# Show an example of a record with scaling applied
display(wines_norm.describe())
show_metric_dist(wines_norm, title='Metric Distributions (After Normalizing & Scaling)')
As seen above, all metrics are now roughly gaussian and have a range from 0 to 1. This data is now optimal input for typical machine learning algorithms.
Before proceeding, we will perform a principal component analysis (PCA). PCA is a technique in which a set of normal axes are greedily chosen such that each axis is oriented such that variance is maximized along that axis. We will perform PCA in order to determine several things:
1) The minimum number of dimensions necessary to preserve the majority of the information (variance) contained in our data. This will be of interest to use if we decide to use fewer dimensions to our ML algorithm in the interest of boosting performance.
2) This will help us determine whether a 3D visualization will be sufficient to see some of the inherent structure in our data.
We will perform a PCA transformation using the same number of dimensions as our existing dataset. We will then calculate the percentage of the overall variance can be encoded in 1, 2, ... n dimensions. We will then examine the first component of our PCA transformed data to identify which metrics account for the most variance.
# Apply PCA by fitting the data with the same number of dimensions as features
from sklearn.decomposition import PCA
pca = PCA(n_components=wines_norm.shape[1], random_state=51)
pca.fit(wines_norm)
#Transform wines_norm using the PCA fit above
pca_samples = pca.transform(wines_norm)
for i in range(wines_norm.shape[1]):
first_n = pca.explained_variance_ratio_[0:i+1].sum()*100
print('Percent variance explained by first {} components: {:.1f}%'.format(i+1, first_n))
print('\nFirst principle component contributions:\n')
first_comp= zip(wines_norm.columns.values, pca.components_[0])
for i, j in first_comp:
print(i, '%.3f' % j)
We can see that the majority of the variance in our data (>95%) can be encoded in 8 of our 11 dimensions. We see that over 65% of our variance can be encoded in 3 dimensions. This suggests that we can expect to see some of the underlying structure in a 3D visualization, but much of it will still be hidden.
In the breakdown of the first principle component that follows, we see that residual sugar, alcohol, and total sulfur dioxide were assigned the highest coefficients, which indicate these are the metrics with the greatest contribution to the variance of our data. This seems to agree with the transformed histogram distributions displayed earlier.
While we will use all 11 metrics in our clustering algorithm, we will first map our our data to 3 dimensions using PCA and visualize. We will return to this compressed dataset after assigning labels to all of our records to visualize the results of our clustering efforts.
pca_3d = PCA(n_components=3, random_state=51)
pca_3d.fit(wines_norm)
#transform wines_norm using the PCA fit above
pca_samples_3d = pca_3d.transform(wines_norm)
#visualize our PCA transformed data
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize= (10, 10))
ax = Axes3D(fig)
ax.scatter(pca_samples_3d[:,0], pca_samples_3d[:,1], pca_samples_3d[:,2], alpha=0.4, color= 'b')
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
plt.show()
If viewing in jupyter notebook, feel free to rotate this plot. From visual inspection, it looks like there are two distinct groups of wine here (likely red and white). Considering that only ~65% of the data's variance is encoded in this 3D plot, we will use an analytical approach before performing our clustering task rather than assuming that 2 clusters is the most reasonable number of groups.
This analytical will be done as follows:
1) We will fit our clustering algorithm using several different values of k where k is the number of clusters. We will have k range from 2 to 20.
2) For each value of k, we will evaluate the clustering results using the average silhouette score. We will plot the sillhouette score against each k value and identify which number of clusters leads to the best results.
3) We will then assign cluster labels to our dataset using the fitted clustering model with the optimal number of clusters k.
The sillhouette score can be roughly described as a measure of how close a sample is to members of its own cluster as compared to members of other clusters. The silhoette score ranges from -1 to 1. A score close to one indicates that a record is very close to other members of its cluster and far from members of other clusters. A score of 0 indicates that a record lies on the decision boundary between two clusters. A negative score indicates that a sample is closer to members of a cluster other than its own. By taking the average silhouette score for all records when various number of clusters are used in our clustering algorithm, we can find the optimal number of clusters that promotes cohesion within individual clusters and good seperability between the clusters.
For the clustering algorithm itself, a Gaussian Mixture Model was chosen. This was chosen for several reasons, including the fact that gaussian mixture models allow for mixed membership; GMM models assign probabilities that a given record belongs to a given cluster. This property may be useful for classifying wines which are blends of multiple types. For example, a wine may be a blend of Cabernet Sauvignon, Merlot.
Another reason for choosing a GMM is that they are more flexible with regards to cluster shapes which deviate from a hyper-spherical one. It is impossible for us to directly observe the actual cluster shapes visually since they exist in an 11-dimensional space, so it is helpful to have a clustering algorithm with such flexibility.
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
max_k=20
sil_scores=[]
for i in range(2, max_k+1):
clusterer = GaussianMixture(n_components=i, random_state=51, n_init=5)
clusterer.fit(wines_norm)
#Predict the cluster for each data point
preds = clusterer.predict(wines_norm)
#Find the cluster centers
centers = clusterer.means_
#Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(wines_norm)
#Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(wines_norm, preds)
sil_scores.append(score)
sil_scores= pd.Series(sil_scores, index= range(2,max_k+1))
max_score= sil_scores.max()
n_clusters= sil_scores.idxmax()
print('Max Silhouette Score: {:.3f}'.format(max_score))
print('Number of clusters: {}\n'.format(max_k))
print('First 3 Silhouette Scores')
print(sil_scores[0:3])
#refit the model to the K with the max silhouette score
clusterer = GaussianMixture(n_components=n_clusters, random_state=51, n_init=5)
clusterer.fit(wines_norm)
#Predict the cluster for each data point
preds = clusterer.predict(wines_norm)
#Find the cluster centers
centers = clusterer.means_
plt.style.use('ggplot')
plt.figure(figsize=(10,8))
plt.title('Silhouette Score vs. Number of Clusters', fontsize=14)
plt.ylabel('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.xticks(np.arange(2, max_k+1, 1))
plt.plot(sil_scores.index.values, sil_scores)
We see that the silhouette score is maximized when 3 clusters are used. Therefore, we fit the GMM using 3 clusters. We will now color each of our clusters and visualized the data in 3 dimensions again. We will append the wine type back to our dataset (red vs. white) to see how this property is distributed among our clusters.
#append cluster labels
pca_3d_clusters= np.append(pca_samples_3d, preds.reshape(-1, 1), axis=1)
#append wine type (red, white)
pca_3d_clusters= np.append(pca_3d_clusters, np.asarray(wines['type']).reshape(-1, 1), axis=1)
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize= (10, 10))
ax = Axes3D(fig)
mapping= {0:'b', 1:'c', 2:'g'}
mapping= {0:'r', 1:'c', 2:'b'}
colors= [mapping[x] for x in preds]
ax.scatter(pca_3d_clusters[:,0], pca_3d_clusters[:,1], pca_3d_clusters[:,2], alpha=0.4, color= colors, marker= 'o')
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
plt.show()
We see that our clusters are pretty well separated in 3d space. Now we will plot each of these clusters individually and color them red if the wine is a red wine and yellow if the wine is a white wine.
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('ggplot')
plt.rcParams['figure.figsize']= [10, 3.5]
fig = plt.figure()
for i in range(3):
ax=fig.add_subplot(1, 3, i+1, projection='3d')
cluster_subset= pca_3d_clusters[pca_3d_clusters[:,3]==i]
type_colors= np.where(cluster_subset[:,4]=='red', 'r', 'y')
ax.scatter(cluster_subset[:,0], cluster_subset[:,1], cluster_subset[:,2], alpha=0.4, color= type_colors, marker= 'o')
ax.set_title('Cluster {}'.format(i))
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
plt.tight_layout(pad=2.0)
plt.show()
We see that the first cluster is almost entirely composed of red wines, and the other two clusters are almost entirely composed of white wines. It appears that our clustering algorithm has recognized the distinction of red and white wines and has also recognized two distinct categories of white wines. Let's now look at some descriptive statistics of each of the clusters.
wines['cluster']= pca_3d_clusters[:, 3]
for i in range(3):
print('Cluster {}'.format(i))
subset= wines[wines['cluster']==i]
display(subset.describe())
Based on the results above, it looks like the main quality distinguishing red wines from white wines is the sulfur dioxide content. This makes sense since red wines generally contain less sulfites than white wines. As for the two categories of white wines, it appears that the main quality distinguishing the two groups of white wines is the residual sugar. Cluster 1, with its higher residual sugar content, is likely composed of sweeter wines such as Rieslings and Ice Wines. Cluster 2, with its lower residual sugar content is likely composed of dryer wines such as Chardonnay and Muscadet.